Week 9.2 - Three Categories of Failure

🎯 What We'll Cover

If “what AI can't do” is a moving target, you need a way of organising failures that doesn't go stale every time a new model drops. This sub-lesson offers one: three categories that classify failures by how durable they are, not by which specific model exhibited them.

The pay-off: when you encounter an AI failure in your own work, you can place it in the taxonomy. Failures in category (a) are diagnostic of model age — switch to a current frontier model and try again. Failures in category (b) require workflow design to manage. Failures in category (c) are structural and won't be fixed by the next release; they need verification protocols (Sub-Lesson 9.5).

✅ (a) Patched / Largely Solved

These were widely-cited failure modes in 2023–24 that have been substantially mitigated by current training pipelines. The original papers documenting them remain valuable as historical landmarks — but they no longer describe current frontier behaviour. Citing them as if they did is a common error in the recent AI literature.

The Reversal Curse

In 2023, Berglund et al. showed that LLMs trained on facts of the form “A is B” failed to generalise to “B is A” — even when the inverse relationship was common in their training data. A model that knew “Tom Cruise's mother is Mary Lee Pfeiffer” would not reliably answer “Who is Mary Lee Pfeiffer's son?”

By 2026, this failure mode is no longer prominent in benchmark discussions. Training-data interventions and inference-time techniques have substantially mitigated it in frontier models. Still cited in the literature, but largely as a historical landmark.

Original: Berglund, L., Tong, M., Kaufmann, M., et al. (2023). The Reversal Curse. arXiv:2309.12288

Basic Arithmetic and Short-Form Reasoning

In 2023, Frieder et al. tested ChatGPT and GPT-4 on graduate-level mathematics and found them mostly inadequate. The paper became a touchstone for “LLMs can't do real maths” arguments. By May 2026 this is thoroughly superseded: GPT-5.5 Pro scores 52.4% on FrontierMath (research-level problems set by Fields medallists), Gemini Deep Think scored 35/42 at IMO 2025, and several Erdős problems have been solved with frontier-model assistance (see 9.3).

Original: Frieder, S., Pinchetti, L., et al. (2023). Mathematical Capabilities of ChatGPT. NeurIPS 2023

Hallucinated Code on Common Tasks

In 2023, asking a model for code in a popular library would frequently produce calls to functions that didn't exist. By 2026, frontier models pass SWE-bench Verified above 87% — meaning they correctly resolve real GitHub issues. The hallucination of function signatures has been largely mitigated for well-documented libraries; it persists for niche or proprietary APIs (which moves it into category b).

📚 The Skill Being Taught

When reading a 2023–24 paper that says “AI cannot do X”, ask: has the same X been retested with current frontier models? If yes, what did the retest find? If no, the claim is unproven on current systems and should be cited cautiously, with the model and date explicit.

🛡️ (b) Reduced but Persistent

These failures have been mitigated but still surface, especially under specific conditions. Frontier models behave better than 2023–era models, but not so well that you can stop checking. These are the failure modes most relevant to research workflows in 2026.

Hallucinated Citations

Even frontier models in 2026 hallucinate citations on niche topics. The Week 5 evidence remains directly relevant:

Magesh et al. (2024). Hallucination-Free? arXiv:2405.20362. Stanford HAI. Lexis+ AI, Westlaw, and Ask Practical Law: 17–33% hallucination on benchmarking queries — despite vendor marketing as “hallucination-free”.
Chelli et al. (2024). JMIR 26: e53164. GPT-3.5 39.6%, GPT-4 28.6%, Bard 91.4%.
Linardon et al. (2025). JMIR Mental Health 12: e80371. GPT-4o: 19.9% overall; 28–29% on niche topics; 6% on well-studied. Citation hallucination correlates with how rare the topic is in training data.

In 2026 these rates have dropped further on common topics but the long-tail problem persists: the rarer a topic is in training data, the more likely AI is to hallucinate when asked about it. For research that touches the long tail by definition, citation verification is non-negotiable.

A vivid current data point: Ant Group's Ling 2.6 1T (April 2026), a 1-trillion-parameter open-weights model targeted at cost-efficient inference, scores around 34 on Artificial Analysis's Intelligence Index but reports a 92% hallucination rate on the AA-Omniscience benchmark (Artificial Analysis; methodology paper arXiv:2511.13029). The hallucination problem is unevenly addressed across the open-weights frontier; not every recent release inherits the mitigations from the leading labs.

Sycophancy

Sharma et al. (2023) documented that RLHF-trained models tend to agree with users even when the user is wrong — a behaviour driven in part by human preference judgements favouring agreement. Anthropic and other labs have invested heavily in mitigations:

Constitutional AI trains the model to evaluate its own outputs against principles including honesty and calibrated confidence (Week 2).
DPO with sycophancy-labelled preference pairs directly penalises sycophantic responses while preserving instruction-following.
Persona-vector interpretability methods extract activation patterns associated with traits like sycophancy and intervene at inference time.

Sycophancy is reduced in frontier 2026 models — but it reappears under specific conditions: long conversations, emotional framing, expert-claimed user identities, and pushback against initial AI outputs. If you want a frontier model to tell you it's wrong, you need to ask in a way that doesn't signal what answer you want.

📝 A May 2026 anecdote: model overconfidence on a maths disagreement

An informal but instructive comparison from researcher @giffmana: presented with a pair of mathematical formulas that appeared to disagree, three frontier models behaved very differently. Claude Opus 4.6 confidently defended a wrong proof and resisted correction even with pushback. ChatGPT Pro reconciled the two formulas correctly but offered the result without much interpretation. Muse Spark did both — reconciled the formulas and explained why the apparent disagreement dissolved.

Anecdotal, single-case, not a benchmark. But it's a useful illustration that overconfident defence of wrong reasoning — the sycophancy/calibration failure mode — is still surfacing in mid-2026 frontier models, and that the failure mode is not uniformly distributed across the frontier. If you want to detect it in your own work, ask the same question of multiple frontier models and notice where they disagree (cross-model triangulation, Sub-Lesson 9.5).

Original sycophancy paper: Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards Understanding Sycophancy in Language Models. ICLR 2024 / arXiv:2310.13548

Calibration

Frontier models are still over-confident. The probability they assign to an answer being correct does not reliably match the actual accuracy. Calibration has improved over time, but a 90%-confident answer may be right 75% of the time, and a 60%-confident answer may be right 80%. This makes confidence-thresholded workflows less useful than you might hope — you can't simply “ask the model only when it's sure”.

🧮 (c) Structural and Likely Persistent

These failures follow from how LLMs work, not from training-data quirks. They are unlikely to be solved by the next model release because they are properties of the architecture or the training paradigm. Verification protocols (Sub-Lesson 9.5) target these specifically.

Hallucination as Statistical Pressure

OpenAI's own analysis: hallucinations originate as binary-classification errors and persist because evaluations reward confident guessing over acknowledging uncertainty. The proposed fix is socio-technical — change how benchmarks score uncertainty — not architectural.

The argument: “Hallucinations need not be mysterious — they originate simply as errors in binary classification.”

Source: Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (Sept 2025). Why Language Models Hallucinate. arXiv:2509.04664

Pattern Completion vs Understanding

LLMs predict next tokens. Whether this constitutes “understanding” in any deeper sense is a philosophical question, but the practical implication is concrete: a model can produce correct output via shortcuts that don't generalise. This is the foundation for the illusions of understanding argument we develop in Sub-Lesson 9.4 (Messeri & Crockett, Nature 2024).

The Long-Tail Problem

Performance degrades on rare topics regardless of model scale. Frontier models in 2026 are excellent on widely-discussed topics and worse on niche ones — exactly where novel research happens. Niimi (2025), cited in Week 5, demonstrated that bibliographic hallucination rates correlate strongly with how often a paper appears in training data. This effect compounds at the frontier of any field.

Compositional Brittleness

The “silent error” theme from Week 7. Each step of an AI-generated analysis can be plausible while the end result is wrong. This is not solved by making each step more accurate, because compositional errors compound multiplicatively. Long agentic chains amplify the problem.

Concrete current example: FoodTruck Bench (2026) tests real-world multi-step agentic tasks. Even DeepSeek V4 Pro — competitive with the closed frontier on most single-step benchmarks — struggles substantially. Individual capabilities (writing code, calling tools, parsing documents) keep improving; chaining them reliably remains hard. This is the failure mode that scales worst as agentic systems get longer.

Domain-Specific Failure Modes

Even frontier models fail differently in different fields, regardless of overall benchmark score. A model excellent at theoretical physics may be brittle on Renaissance history. Aggregate benchmarks don't predict performance in your specific domain. This is why the hands-on activities (9.6) ask you to test capability in your own field rather than rely on published numbers.

Training Data Dependence

Models can only know what was in their training data. They will confidently answer questions about events after their training cutoff using outdated information. They may have been trained on a biased sample of the literature in your field. This dependence is structural — it doesn't go away with more parameters or better RLHF.

📖 Background Reading

Kapoor, S., & Narayanan, A. (ongoing). normaltech.ai — the blog formerly known as AI Snake Oil. The structural-failures position; argues that many AI capability claims are overstated and many failures are structural rather than version-specific. The 2024 Princeton UP book AI Snake Oil is a longer version of the argument.

Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (Sept 2025). Why Language Models Hallucinate. arXiv:2509.04664 — OpenAI's technical paper on the structural origins of hallucination. Reading both Kapoor & Narayanan and Kalai et al. gives you views from outside and inside the labs respectively.

🧾 Putting the Taxonomy to Work

Here's how the taxonomy plays out for failures you might actually encounter in research workflows. When you observe a failure, locate it in this table:

Observed Failure	Category	Diagnosis & Action
Model fails on simple arithmetic in your prompt	Patched	You're using an old model. Switch to a current frontier (Opus 4.7, GPT-5.5, etc.) and try again.
Model produces a confident citation that doesn't exist on a niche topic	Reduced but persistent	Run the Five-Point Citation Check from Week 5. Don't skip verification because the model sounded confident.
Model agrees with your hypothesis after you express attachment to it	Reduced but persistent	Sycophancy. Re-prompt without signalling your preferred answer; cross-check with a different model.
Model produces code that runs perfectly but gives wrong numerical results	Structural (compositional)	The Week 7 silent-error problem. Verification with known-answer testing is mandatory.
Model gives you summary statistics on a topic that look reasonable but on inspection are wrong	Structural (long tail)	You've hit the long-tail problem. Verify against primary sources; consider whether the model's training data covers your topic adequately.
Model is excellent in one of your domains and brittle in another	Structural (domain-specific)	Test capability in each domain you use AI in. Don't generalise from one domain's reliability to another's.

The skill is the taxonomy, not the specific examples

The specific examples in this sub-lesson will date. The reversal curse will eventually be a footnote that no one cites. New patched failures will replace it. The structural-failures list may grow or shrink as research clarifies what is fundamental and what is not.

What persists is the question you ask when you encounter a failure: is this patched, reduced-but-persistent, or structural? The answer determines what you do about it.

👉 What Comes Next

Sub-Lesson 9.3 — Where AI Is Now Genuinely Strong. Having looked at the trajectory and the failure taxonomy, we now look at the other side: what current frontier models are genuinely good at, with concrete examples from mathematics and theoretical physics in 2026. Many students underestimate AI in some directions; recalibration is the work.